A baseline for content-based blog classification

نویسندگان

  • Olof Görnerup
  • Magnus Boman
چکیده

A content-based network representation of web logs (blogs) using a basic word-overlap similarity measure is presented. Due to a strong signal in blog data the approach is sufficient for accurately classifying blogs. Using Swedish blog data we demonstrate that blogs that treat similar subjects are organized in clusters that, in turn, are hierarchically organized in higher-order clusters. The simplicity of the representation renders it both computationally tractable and transparent. We therefore argue that the approach is suitable as a baseline when developing and analyzing more advanced content-based representations of the blogosphere.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Blog Posts and Comments Extraction and Impact on Retrieval Effectiveness

This paper is focused on the extraction of certain parts of a blog: the post and the comments, presenting a technique based on the blog structure and its elements attributes, exploiting similarities and conventions among different blog providers or Content Management Systems (CMS). The impact of the extraction process over retrieval tasks is also explored. Separate evaluation is performed for b...

متن کامل

Enhancing Concept Based Modeling Approach for Blog Classification

Blogs are user generated content discusses on various topics. For the past 10 years, the social web content is growing in a fast pace and research projects are finding ways to channelize these information using text classification techniques. Existing classification technique follows only boolean (or crisp) logic. This paper extends our previous work with a framework where fuzzy clustering is o...

متن کامل

Learning To Recognize Blogs: A Preliminary Exploration

We present results of our experiments with the application of machine learning on binary blog classification, i.e. determining whether a given web page is a blog page. We have gathered a corpus in excess of half a million blog or blog-like pages and pre-classified them using a simple baseline. We investigate which algorithms attain the best results for our classification problem and experiment ...

متن کامل

Multiple Ranking Strategies for Opinion Retrieval in Blogs - The University of Amsterdam at the 2006 TREC Blog Track

We describe our participation in the Opinion Retrieval task at TREC 2006. Our approach to identifying opinions in blog post consisted of scoring the posts separately on various aspects associated with an expression of opinion about a topic, including shallow sentiment analysis, spam detection, and link-based authority estimation. The separate approaches were combined into a single ranking, yiel...

متن کامل

A public opinion classification algorithm based on micro-blog text sentiment intensity: Design and implementation

on the features of short content and nearly realtime broadcasting velocity of micro-blog information, our lab constructed a public opinion corpus named MPO Corpus. Then, based on the analysis of the status of the network public opinion, it proposes an approach to calculate the sentiment intensity from three levels on words, sentences and documents respectively in this paper. Furthermore, on the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/0909.4416  شماره 

صفحات  -

تاریخ انتشار 2009